npj Digital Medicine — Latest Matching Preprints

1

Learning Patient-Specific Event Sequence Representations for Clinical Process Analysis

Solyomvari, K.; Antikainen, T.; Moen, H.; Marttinen, P.; Renkonen, R.; Koskinen, M.

2026-03-30 health informatics 10.64898/2026.03.25.26348333 medRxiv

Top 0.1%

61.3%

Show abstract

Healthcare system performance evaluation is constrained by episodic performance indicators and process mining techniques that fail to accommodate the scale, heterogeneity, and temporal complexity of real-world clinical pathways. Electronic health records enable reconstructing patient journeys that capture how care processes unfold across fragmented healthcare services. Here we present ClinicalTAAT, a time-aware transformer that bridges clinical sequence modeling and process mining by integrating contextual and time-varying information to learn interpretable patient-specific representations from inherently sparse, irregular and high-dimensional clinical event sequences. Evaluated on a large pediatric emergency cohort, ClinicalTAAT outperforms existing models in acuity and diagnosis classification, identifies clinically meaningful patient subgroups in heterogenous population with distinct acuity, resource utilization and diagnostic patterns, and detects anomalies in individual care trajectories. These findings demonstrate that time-aware transformers can complement existing process mining methodologies and serve as foundation models for clinical process analysis, providing a scalable framework for data-driven healthcare evaluation and optimization.

2

Cognitive AI-Assisted Primary Care Health Delivery: A Pilot Study in Bangladesh

Kabir, R. A.; Williams, M.; Rayhan, N.

2026-04-05 public and global health 10.64898/2026.04.03.26349253 medRxiv

Top 0.1%

58.2%

Show abstract

Research has documented persistent physician workforce shortages globally, with projected shortfalls threatening primary care access in underserved populations. Existing AI applications in healthcare have largely focused on predictive risk-scoring tools that generate probability estimates but do not reduce the time a physician spends completing a patient encounter. A January 2025 study further demonstrated that large language models lack the metacognitive capacity necessary for reliable medical reason ing, i.e., being able to ask appropriate questions in the absence of information to collect patient history and update differential diagnoses. This paper reports on a 2025 pilot deployment of ClinicalAssist in Bangladesh that tested a fundamentally different model: An AI system designed to replicate every step of the clinical workflow. Across 239 unique patients, 277 encounters, and 287 diagnostic opportunities, the system achieved an overall diagnostic accuracy of 94.7%, with chronic disease accuracy of 98.0% and acute care accuracy of 88.9%. These results suggest that cognitive AI has the potential to be a powerful clinical force multiplier if properly integrated in workflow.

3

Digital journaling enables privacy-preserving behavioral phenotyping and real-time risk monitoring at scale

Milham, M.; Low, D.; Erkent, A.; Trabulsi, J.; Kass, M. C.; Vos de Wael, R.; Yenepalli, S.; Wang, Y.; Leyden, M.; Jordan, C.; Salum, G.; Alexander, L.; Schubiner, G.; Hendrix, L.; Koyama, M.; Mears, L.; McAdams, R.; White, C.; Merikangas, K.; Satterthwaite, T. D.; Franco, A.; Klein, A.; Koplewicz, H.; Leventhal, B.; Freund, M.; Kiar, G.

2026-04-08 psychiatry and clinical psychology 10.64898/2026.04.04.26349881 medRxiv

Top 0.1%

52.3%

Show abstract

Digital mental health applications enable high-frequency behavioral monitoring and scalable interventions. Journaling provides a therapeutically grounded and intrinsically engaging activity for many users. AI-based text analysis enables privacy-preserving phenotyping of clinically relevant patterns in naturalistic writing, including emotional distress and behavioral risk (e.g., indicators of intent, planning, or preparatory actions for harm to self or others). We evaluated a mobile journaling platform in an 8-week randomized controlled trial (N = 507) of young adults with mild-to-moderate anxiety and depression symptoms. Journaling produced modest reductions in anxiety relative to controls at the 8-week endpoint and 1-month follow-up (d = 0.16-0.19). Effects were small and did not remain significant after correction for multiple comparisons; complementary Bayesian models nonetheless provided moderate-to-strong directional evidence (90-97%) supporting a modest anxiety reduction. In parallel, behavioral phenotyping analyses showed that high-risk journal entries were more common among younger users (OR = 0.77 per year of age, p = 0.007). Text-based risk signals and self-reported energy exhibited significant circadian variation (e.g., risk probability was highest during late-night and overnight hours). Within-person analyses demonstrated strong short-term persistence in mood and risk states, with calm/relaxed showing the highest persistence and anxious/agitated exhibiting the lowest persistence. High-risk journal entries clustered temporally and were preceded by sustained low valence and energy. Although affective volatility was associated with acute declines within the same affective dimension (pleasantness or energy), it was not associated with escalation to high-risk states. Key behavioral dynamics observed in the trial were replicated in an independent general population dataset (N = 16,630). Collectively, these findings demonstrate that privacy-preserving digital journaling can support scalable longitudinal behavioral phenotyping and real-time risk monitoring while providing modest clinical benefit for anxiety symptoms.

4

Patient-Centred Communication in Lung Cancer Screening: A Clinically Focussed Evaluation of a Fine-Tuned Open-Source Model Against a Larger Frontier System

Khanna, S.; Chaudhary, R.; Narula, N.; Lee, R.

2026-04-11 oncology 10.64898/2026.04.10.26350595 medRxiv

Top 0.1%

43.1%

Show abstract

Lung cancer screening saves lives, yet uptake remains suboptimal and inequitable. Personalised communication can improve attendance and reduce anxiety, but scaling such support is a workforce challenge. We fine-tuned Googles Gemma 2 9B using QLoRA on 5,086 synthetic screening conversations and compared it against Googles Gemini 2.5 Flash (a larger frontier model) and an unmodified baseline across 300 multi-turn conversations with 100 patient personas spanning ten clinical categories. Evaluation combined automated natural language processing metrics with independent language model judgement in two complementary modes: structured clinical rubric and simulated patient persona. The fine-tuned model achieved the highest simulated patient experience score (3.71/5 vs 3.65 for the frontier model), recorded zero boundary violations after clinician review of all flagged instances, and led on the four most safety-critical categories. A composite Patient Adaptation Index showed that the fine-tuned model led overall (0.37 vs 0.35 vs 0.35), with its clearest advantage on the two clinically specific components: empathy calibration to patient distress and selective smoking cessation signposting. These findings suggest that targeted fine-tuning of open-source models can yield clinical communication quality comparable to larger proprietary systems, with advantages in safety-critical scenarios and suitability for NHS data governance constraints. Human clinician review of these conversations is ongoing.

5

Multimodal prediction of visual improvement in diabetic macular edema using real-world electronic health records and optical coherence tomography images

Sun, S.; Cai, C. X.; Fan, R.; You, S.; Tran, D.; Rao, P. K.; Suchard, M. A.; Wang, Y.; Lee, C. S.; Lee, A. Y.; Zhang, L.

2026-04-24 health informatics 10.64898/2026.04.23.26351616 medRxiv

Top 0.1%

42.8%

Show abstract

Multimodal learning has the potential to improve clinical prediction by integrating complementary data sources, but the incremental value of imaging beyond structured electronic health record (EHR) data remains unclear in real-world settings. We developed a multimodal survival modeling framework integrating optical coherence tomography (OCT) and EHR data to predict time to visual improvement in patients with diabetic macular edema (DME), and evaluated how different ophthalmic foundation model representations contribute to prognostic performance. In a retrospective cohort of 973 patients (1,450 eyes) receiving anti-vascular endothelial growth factor therapy, we compared multimodal models combining 22,227 EHR variables with 196,402 OCT images, with OCT embeddings derived from three ophthalmic foundation models (RETFound, EyeCLIP, and VisionFM). The EHR-only model showed minimal prognostic discrimination (C-index 0.50 [95% CI, 0.45-0.55]). Incorporating OCT improved performance, with the magnitude of improvement depending on the representation. EHR+RETFound achieved the strongest performance (C-index 0.59 [0.54-0.65]), followed by EHR+EyeCLIP (0.57 [0.52-0.62]) and EHR+VisionFM (0.56 [0.51-0.61]). Multimodal models, particularly EHR+RETFound, demonstrated improved risk stratification with clearer separation of Kaplan-Meier curves. Partial information decomposition revealed that prognostic information was dominated by modality-specific contributions, with OCT and EHR providing largely distinct signals and minimal shared information. The magnitude of OCT-specific contribution varied across foundation models and aligned with observed performance differences. These findings indicate that OCT provides complementary prognostic value beyond structured clinical data, but gains are modest and depend strongly on representation choice. Our results highlight both the promise of multimodal modeling for personalized prognosis and the need for rigorous, context-specific evaluation of foundation models in real-world clinical settings.

6

LLM-Driven Target Trial Emulation with Human-in-the-Loop Validation for Randomized Trial: Automated Protocol Extraction and Real-World Outcome Evaluation{Psi}

Dey, S. K.; Qureshi, A. I.; Shyu, C.-R.

2026-04-13 health informatics 10.64898/2026.04.09.26350523 medRxiv

Top 0.1%

42.3%

Show abstract

Target trial emulation (TTE) enables causal inference from observational data but remains bottlenecked by manual, expert-dependent protocol operationalization. While large language models (LLMs) have advanced clinical knowledge extraction and code generation, their ability to automate end-to-end TTE workflows remains largely unexplored. We present an LLM-driven framework using retrieval-augmented generation to extract the five core TTE design parameters from the Carotid Revascularization and Medical Management for Asymptomatic Carotid Stenosis Trial (CREST-2) protocol and generate executable phenotyping pipelines for real-world EHR data. The performance of the framework was evaluated along two dimensions. First, protocol extraction accuracy was assessed against a gold-standard checklist of trial design components using precision, recall, and F1-score metrics. Second, outcome validity was evaluated through population-level concordance analyses comparing EHR-derived outcomes with published trial endpoints using standardized mean difference, observed-to-expected ratios, confidence interval overlap, and two-proportion z-tests. Further, Human-in-the-loop validation assessed the correctness of extracted clinical logic and phenotype definitions. Together, these evaluations demonstrate a structured approach for assessing LLM-driven protocol-to-pipeline translation for scalable real-world evidence generation.

7

The Clinician Model Card: development and evaluation of clinician-centered documentation for AI-based clinical decision support

Agha-Mir-Salim, L.; Frey, N.; Kaiser, Z.; Mosch, L.; Weicken, E.; Freyer, O.; Ma, J.; Mittermaier, M.; Meyer, A.; Gilbert, S.; Muller-Birn, C.; Balzer, F.

2026-04-17 health informatics 10.64898/2026.04.15.26350930 medRxiv

Top 0.1%

41.5%

Show abstract

AI documentation frameworks remain poorly designed for point-of-care use, leaving clinicians without actionable information on how to use clinical AI models when they need it most. We developed the Clinician Model Card, an interactive, clinician-centered documentation tool, and evaluated it in a sequential exploratory mixed-methods study: interviews with 12 physicians informed iterative co-design, evaluated in a national survey of 129 physicians across Germany. The tool was well-received: 84% agreed it should be routinely available, and 66% considered its content relevant to clinical decision-making. Yet comprehensibility of statistical performance metrics remained poor despite targeted interventions: only 32% understood the Validation & Performance section well, and fewer than 54% correctly interpreted AUROC or PPV, with AI literacy as strong predictor of comprehension. We propose empirically derived design principles for clinician-centered AI documentation. Effective AI transparency requires not only clinician-friendly design and workflow integration, but sustained investment in AI literacy.

8

Interpretable AI for Accelerated Video-Based Surgical Skill Assessment: A Highlights-Reel Approach

Lafouti, M.; Feldman, L. S.; Hooshiar, A.

2026-04-20 medical education 10.64898/2026.04.18.26351193 medRxiv

Top 0.1%

29.7%

Show abstract

BackgroundManual video-based evaluation of surgical skills can be time-consuming and delays trainee feedback. Artificial intelligence (AI) offers opportunities to automate aspects of assessment while maintaining clinician oversight. We developed an interpretable spatiotemporal model that classifies surgical expertise directly from endoscopic video in standardized training tasks and generates saliency-based "highlights reels" showing the most influential frames. MethodsAn RGB pipeline combining InceptionV3 for spatial feature extraction and a gated recurrent unit (GRU) for temporal modeling was trained on the JIGSAWS dataset. The model outputs novice, intermediate, or expert labels. A rolling-window, low-latency evaluation at 30 fps with a stride of 10 frames was used. A motion-augmented variant fused RGB with optical-flow features. Spatial and temporal saliency maps highlighted key decision-making regions. ResultsThe RGB model achieved 95% accuracy (F1: 92% expert, 86% intermediate, 99% novice). Performance was strongest for novice and expert trials, while intermediate trials showed the lowest recall, consistent with greater ambiguity around the intermediate skill level. Saliency maps consistently emphasized tool-tissue interactions and peaked during technically demanding phases. The optical-flow variant underperformed, approximately 38% accuracy, which may reflect sensitivity to global camera motion and other non-informative motion patterns. ConclusionsThis interpretable AI pipeline accurately classifies surgical skill while producing intuitive visual highlights. Future work will refine highlight thresholds and validate on laparoscopic inguinal hernia repair for realworld deployment.

9

Individualized Forecasting of Headache Attack Risk Using a Continuously Updating Model

Houle, T. T.; Lebowitz, A.; Chtay, I.; Patel, T.; McGeary, D. D.; Turner, D. P.

2026-04-22 neurology 10.64898/2026.04.20.26350119 medRxiv

Top 0.1%

28.9%

Show abstract

ImportanceMigraine attacks often occur unpredictably, limiting the ability of individuals to initiate timely preventive or preemptive treatment. Short-term probabilistic forecasting of migraine risk could enable more targeted management strategies. ObjectiveTo externally validate the previously developed Headache Prediction Model (HAPRED-I), evaluate an updated continuously learning model (HAPRED-II), and assess the feasibility and short-term safety of delivering individualized probabilistic migraine forecasts directly to patients. Design, Setting, and ParticipantsProspective 8-week cohort study conducted remotely at two academic medical centers in the United States (Massachusetts General Hospital and Wake Forest Health Sciences) between 2015 and 2019. Adults with recurrent migraine or tension-type headache completed twice-daily electronic diaries. A total of 230 participants contributed 23,335 diary entries across 11,862 participant-days of observation. Main Outcomes and MeasuresOccurrence of a headache attack within 24 hours following each evening diary entry. Model performance was evaluated using discrimination (area under the receiver operating characteristic curve [AUC]) and calibration. ResultsExternal validation of HAPRED-I demonstrated modest discrimination (AUC, 0.59; 95% CI, 0.57-0.61) and poor calibration, with predicted probabilities consistently exceeding observed headache risk. In contrast, the continuously updating HAPRED-II model demonstrated progressive improvement in predictive performance as participant-specific data accumulated. Discrimination increased from an AUC of 0.59 (95% CI, 0.57-0.61) during the first 14 days to 0.66 (95% CI, 0.63-0.70) after the first month, accompanied by improved calibration across predicted risk levels. Over the study period, 6999 individualized forecasts were delivered directly to participants. No evidence suggested that receipt of forecasts was associated with increasing headache frequency or worsening predicted headache risk trajectories. Conclusions and RelevanceA static migraine forecasting model demonstrated limited transportability to new individuals. In contrast, models that continuously update within individuals may improve predictive accuracy over time and enable real-time delivery of personalized migraine risk forecasts. Further work incorporating richer physiologic and contextual predictors will likely be necessary before such systems can reliably guide clinical treatment decisions.

10

Dissecting clinical reasoning failures in frontier artificial intelligence using 10,000 synthetic cases

Auger, S. D.; Varley, J.; Hargovan, M.; Scott, G.

2026-04-23 neurology 10.64898/2026.04.22.26351488 medRxiv

Top 0.1%

28.8%

Show abstract

Background: Current medical large language model (LLM) evaluations largely rely on small collections of cases, whereas rigorous safety testing requires large-scale, diverse, and complex cases with verifiable ground truth. Multiple Sclerosis (MS) provides an ideal evaluation model, with validated diagnostic criteria and numerous paraclinical tests informing differential diagnosis, investigation, and management. Methods: We generated synthetic MS cases with ground-truth labels for diagnosis, localisation, and management. Four frontier LLMs (Gemini 3 Pro/Flash, GPT 5.2/5 mini) were instructed to analyse cases to provide anatomical localisation, differential diagnoses, investigations, and management plans. An automated evaluator compared these outputs to the ground-truth labels. Blinded subspecialty experts validated 70 cases for realism and automated evaluator accuracy. We then evaluated LLM decision-making across 1,000 cases and scaled to 10,000 to characterise rare, catastrophic failures. Results: Subspecialist expert review confirmed 100% synthetic case realism and 99.8% (95% CI 95.5 to 100) automated evaluation accuracy. Across 1,000 generated MS cases, all LLMs successfully included MS in the differential diagnoses for more than 91% cases. However, diagnostic competence did not associate with treatment safety. Gemini 3 models had low rates of clinically appropriate steroid recommendations (Flash: 7.2% 95% CI 5.6 to 8.8; Pro: 15.8% 95% CI 13.6 to 18.1) compared to GPT 5 mini (23.5% 95% CI 20.8 to 26.1), frequently overlooking contraindications like active infection. OpenAI models inappropriately recommended acute intravenous thrombolysis for MS cases (9.6% GPT 5.2; 6.4% GPT 5 mini) compared to below 1% for Gemini models. Expanded evaluation (to 10,000 cases) probed these errors in detail. Thrombolysis was recommended in 10.1% of cases lacking symptom timing information and paradoxically persisted (2.9%) even when symptoms were explicitly documented as more than 14 days old. Conclusion: Automated expert-level evaluation across 10,000 cases characterised artificial intelligence clinical blind spots hitherto invisible to small-scale testing. Massive-scale simulation and automated interrogation should become standard for uncovering serious failures and implementing safety guardrails before clinical deployment exposes patients to risk.

11

Validated Synthetic Data Generation from a Multicenter Spine Surgery Registry: Methodology and Benchmark

Challier, V.; Jacquemin, C.; Diebo, B.; Dehouche, N.; Denisov, A.; Cristini, J.; Campana, M.; Castelain, J.-E.; Lonjon, G.; Lafage, V.; Ghailane, S.; SpineDAO Collaborative Group,

2026-04-11 health informatics 10.64898/2026.04.07.26350316 medRxiv

Top 0.1%

28.7%

Show abstract

BackgroundSynthetic data have emerged as a complementary strategy for secondary use of clinical registries, enabling data sharing without patient-level exposure. In spine surgery, multicenter data sharing is constrained by institutional governance and patient privacy regulations. Validated synthetic data generation may enable broader access to surgical outcomes data for artificial intelligence development without compromising patient confidentiality. ObjectiveTo describe and benchmark a three-domain validated synthetic data pipeline applied to a multicenter, tokenized spine surgery registry (SpineBase), and to establish a reproducible certification framework for synthetic spine surgery datasets. MethodsWe extracted 125 sacroiliac joint fusion cases from the SpineBase registry (SIBONE study, IRB-SOFCOT approval Ref. 14-2025; CNIL MR-004 Ref. 2234503 v 0). A GaussianCopula generative model was trained on 52 structured variables spanning demographics, preoperative assessments, operative details, and longitudinal outcomes at 3, 6, 12, and 24 months. Synthetic datasets of 100, 1,000, and 10,000 patients were generated. Validation followed a three-domain framework: (1) fidelity, assessed by Kolmogorov-Smirnov tests and Jensen-Shannon divergence; (2) utility, assessed by train-on-synthetic, test-on-real (TSTR) methodology; and (3) privacy, assessed by nearest-neighbor distance ratio (NNDR), membership inference attack, and k-anonymity proxy. ResultsAll three validation gates passed. Fidelity: mean KS p-value 0.52 (threshold >0.05). Privacy: NNDR >1.0 in 98.9% of synthetic records; membership inference AUROC 0.57. Utility: 12-month Oswestry Disability Index prediction yielded Pearson r = 0.29, consistent with expected attenuation at N = 125. A SHA-256 cryptographic hash of each certified dataset was anchored on the Solana blockchain for immutable provenance. ConclusionsA validated, blockchain-anchored synthetic data pipeline for spine surgery registries is technically feasible and meets current publication-standard criteria for fidelity and privacy. Utility metrics scale with registry size, creating a direct incentive for multicenter data contribution. This framework provides a reproducible methodology for synthetic data certification in spine surgery research, and establishes certified synthetic datasets as a privacy-native substrate for expert-annotation pipelines -- as demonstrated in the companion Spine Reviews study.

12

MedSafe-Dx (v0): A Safety-Focused Benchmark for Evaluating LLMs in Clinical Diagnostic Decision Support

Van Oyen, C.; Mirza-Haq, N.

2026-04-21 health informatics 10.64898/2026.04.14.26350711 medRxiv

Top 0.1%

28.4%

Show abstract

MedSafe-Dx (v0), introduces a new safety-focused benchmark for evaluating large language models in clinical diagnostic decision support using a filtered subset of the DDx Plus dataset (N=250). MedSafe-Dx evaluates three dimensions: escalation sensitivity, avoidance of false reassurance, and calibration of uncertainty. Models were tasked with providing a ranked differential (ICD-10), an escalation decision (Urgent vs. Routine), and a confidence flag. Performance was measured via a "Safety Pass Rate," a composite metric penalizing three hard failure modes: missed escalations of life-threatening conditions, overconfident incorrect diagnoses, and unsafe reassurance in ambiguous cases. Eleven models were evaluated and revealed a significant disconnect between diagnostic recall and safety. GPT-5.2 achieved the highest Safety Pass Rate (97.6%), while several models exhibited high rates of missed escalations or unsafe reassurance. MedSafe-Dx provides a robust stress test for identifying high-risk failure modes in diagnostic decision support and shows that high diagnostic accuracy does not guarantee clinical safety. While the benchmark is currently limited by synthetic data and proxy labels, it provides a reproducible, auditable framework for testing AI behavior before clinical deployment. Our findings suggest that interventions such as safety-focused prompting and reasoning-token budgets could be essential components for the safe deployment of LLMs in clinical workflows.

13

cliexa-RA Implementation in Colorado Arthritis Center: A Case Study of Quadruple Aim Impacts

Kazgan, M.

2026-04-01 rheumatology 10.64898/2026.03.29.26349644 medRxiv

Top 0.1%

28.4%

Show abstract

Background: Digital health platforms can improve clinical efficiency and patient outcomes, but adoption in routine care remains limited due to workflow and integration challenges. Rheumatoid arthritis (RA) management relies on consistent capture of patient-reported and clinical data, which is often time-intensive and inconsistently documented. Objective: To assess the impact of the cliexa-RA digital platform on patient experience, physician workflow, and cost-related outcomes using the Quadruple Aim framework. Methods: A six-month pilot study was conducted at the Colorado Arthritis Center involving 300 RA patients. Patients completed a 16-question intake (RAPID3-based), followed by clinician-entered joint assessments. The platform generated five disease activity scores (DAS28-ESR, DAS28-CRP, SDAI, CDAI, RAPID3) and produced EMR-compatible outputs. Time metrics, patient satisfaction, and workflow efficiency were evaluated. Results: Mean patient intake time was 2.4 minutes, a 52% reduction compared to paper-based processes. Clinician time for calculation and documentation decreased by 77%, with near real-time EMR integration. Overall patient satisfaction was high (3.55/4), with 85% recommending the platform. Physicians reported improved documentation efficiency and workflow integration. Administrative cost reductions were observed through decreased reporting burden and improved compliance with quality reporting requirements. Conclusions: The cliexa-RA platform significantly improved efficiency and user experience in RA management. These findings support the role of integrated digital tools in reducing administrative burden and enabling scalable, data-driven care, with potential downstream benefits for cost and population health.

14

Artificial Intelligence Agents in Mental Health: A Systematic Review and Meta Analysis

Zhu, L.; Wang, W.; Liang, Z.; Tan, W.; Chen, B.; Lin, X.; Wu, Z.; Yu, H.; Li, X.; Jiao, J.; He, S.; Dai, G.; Niu, J.; Zhong, Y.; Hua, W.; Chan, N. Y.; Lu, L.; Wing, Y. K.; Ma, X.; Fan, L.

2026-04-22 psychiatry and clinical psychology 10.64898/2026.04.21.26351365 medRxiv

Top 0.1%

27.4%

Show abstract

The rapid rise of large language models (LLMs) and foundation models has accelerated efforts to build artificial intelligence (AI) agents for mental health assessment, triage, psychotherapy support and clinical decision assistance. Yet a gap persists between healthcare and AI-focused work: while both communities use the language of "agents," clinical research largely describes monolithic chatbots, whereas AI studies emphasize agentic properties such as autonomous planning, multiagent coordination, tool and database use and integration with multimodal mental health data streams. In this Review, we conduct a systematic analysis of mental health AI agent systems from 2023 to 2025 using a six-dimensional audit framework: (i) system type (base model lineage, interface modality and workflow composition, from rule-based tools to role-aware multi-agent foundation-model systems), (ii) data scope (modalities and provenance, from elicited self-report and chatbot dialogues to electronic health records, biosensing and synthetic corpora), (iii) mental health focus (mapped to ICD-11 diagnostic groupings), (iv) demographics (age strata, geography and sex representation), (v) downstream tasks (screening/triage, clinical decision support, therapeutic interventions, documentation, ethical-legal support and education/simulation) and (vi) evaluation types (automated metrics, language quality benchmarks, safety stress tests, expert review and clinician or patient involvement). Across this corpus, we find that most systems (1) concentrate on depression, anxiety and suicidality, with sparse coverage of severe mental illness, neurocognitive disorders, substance use and complex comorbidity; (2) rely heavily on text-based self-report rather than clinically verified longitudinal data or genuinely multimodal inputs; (3) are implemented as single-agent chatbots powered by general-purpose LLMs rather than role-structured, workflow-integrated pipelines; and (4) are evaluated primarily via offline metrics or vignette-based scenarios, with few prospective, clinician- or patient-in-the-loop studies. At the same time, an emerging class of agentic systems assigns foundation models explicit roles as planners, retrieval agents, safety auditors or supervisors coordinating other models and tools. These multiagent, tool-augmented workflows promise personalization, safety monitoring and greater transparency, but they also introduce new risks around reliability, bias amplification, privacy, regulatory accountability and the blurring of clinical versus non-clinical roles. We conclude by outlining priorities for the next generation of mental health AI agents: clinically grounded, role-aware multi-agent architectures; transparent and privacy-preserving use of clinical and elicited data; demographic and cultural broadening beyond predominantly Western adult samples; and evaluation pipelines that progress from offline benchmarks to longitudinal, real-world studies with routine safety auditing and clear governance of responsibilities between agents and human clinicians.

15

Self-Reported Symptoms Enable Four-Phase Menstrual Cycle Classification with Hormonally Validated Labels

Specht, B.; Tayeb, Z. Z.; Garbaya, S.; Khadraoui, D.; EL-Khozondar, M.; Schneider, R.

2026-04-01 health informatics 10.64898/2026.03.31.26349766 medRxiv

Top 0.1%

26.5%

Show abstract

Accurate inference of physiological state across the menstrual cycle has important applications in reproductive health and in understanding symptom dynamics, yet most non-hormonal approaches rely on wearable sensors or calendar-based tracking. Whether self-reported symptoms alone can support prospective, cross-subject phase classification remains unresolved. Here, we introduce a hybrid modelling framework that combines a gradient-boosted classifier with a Hidden Semi-Markov Model to infer four menstrual cycle phases (menstrual, follicular, fertile, and luteal) from self-reported data. The classifier captures non-linear symptom patterns, while the temporal model imposes biologically grounded constraints, including cyclic ordering and realistic phase durations. In a leave-one-subject-out evaluation using hormonally annotated data from 41 participants, the model achieved 67.6\% accuracy and a macro F1 score of 0.662. Features reflecting short-term symptom variability were more informative than absolute symptom levels, indicating that within-person fluctuation provides a more generalisable signal of cycle phase than symptom intensity alone. These findings demonstrate the feasibility of low-burden, device-free menstrual health monitoring, establish symptom dynamics as a basis for scalable digital biomarkers, and expand access to tracking in resource-constrained settings and populations underserved by wearable-based approaches.

16

Counterfactual prediction of treatment effects on irregular clinical data using Time-Aware G-Transformers

Hornak, G.; Heinolainen, A.; Solyomvari, K.; Silen, S.; Renkonen, R.; Koskinen, M.

2026-04-02 health informatics 10.64898/2026.04.01.26349920 medRxiv

Top 0.1%

26.4%

Show abstract

Selecting an effective treatment relies on accurately anticipating patient's response to alternative interventions. However, forecasting longitudinal clinical trajectories remains difficult because electronic health records contain heterogeneous, irregularly sampled data over extended time periods. These issues are especially relevant for laboratory measurements, which are central for diagnostics, assessment of therapeutic responses, and tracking disease progression in routine clinical practice. However, existing deep learning methods for counterfactual prediction usually assume regularly sampled data, an assumption incompatible with the irregular, heterogeneous data-generation processes of real-world clinical practice. Here we present the Time-Aware G-Transformer, which integrates causal G-computation with time-aware attention to predict counterfactual outcomes on irregular data. By explicitly conditioning on the timing of future observations and encoding measurement patterns, the model captures temporal dynamics that previous methods overlook. Evaluated on synthetic tumor growth data and on 90,753 cancer patient trajectories from an academic medical center, our approach demonstrates superior long-horizon (> 1 day) prediction accuracy and uncertainty calibration compared to state-of-the-art baselines. These results demonstrate that embedding temporal relations directly into the attention mechanism enables robust integration of patient history data for evaluating potential treatment strategies in personalized medicine.

17

Wearable-derived physiological features for trans-diagnostic disease comparison and classification in the All of Us longitudinal real-world dataset

Huang, X.; Hsieh, C.; Nguyen, Q.; Renteria, M. E.; Gharahkhani, P.

2026-04-13 epidemiology 10.64898/2026.04.07.26350352 medRxiv

Top 0.1%

26.3%

Show abstract

Wearable-derived physiological features have been associated with disease risk, but most current studies focus on single conditions, limiting understanding of cross-disease patterns. This study adopts a trans-diagnostic approach to examine whether wearable data capture shared and condition-specific physiological signatures across multiple chronic conditions spanning physical and mental health, and then evaluates the utility of these features for disease classification. A total of 9,301 patients with at least 21 days of consecutive FitBit data from the All of Us Controlled Tier Dataset version 8 were analyzed. Disease subcohorts included cardiovascular disease (CVD), diabetes, obstructive sleep apnea (OSA), major depressive disorder (MDD), anxiety, bipolar disorder, and attention-deficit/ hyperactivity disorder (ADHD), chosen based on prevalence and relevance. Logistic regression and XGBoost models were fitted for each disease subcohort versus the control cohort. We found that compared to using just baseline demographic and lifestyle features, incorporating wearable-derived features enabled improved classification performance in all subcohorts for both models, except for ADHD where improvement was mainly observed for ROC-AUC in logistic regression model likely due to the smaller sample size in ADHD subcohort. The largest performance gains were observed in MDD (increase in ROC-AUC of 0.077 for Logistic regression, 0.071 for XGBoost; p < 0.001) and anxiety (increase in ROC-AUC of 0.077 for logistic regression, 0.108 for XGBoost; p < 0.001). This study provides one of the first comprehensive transdiagnostic evaluations of wearable-derived features for disease classification, highlighting their potential to enhance risk stratification in the real-world setting as a practical complement to clinical assessments and providing a foundation to explore more fine-grained wearable data. Author summaryWearable devices such as fitness trackers and smartwatches are becoming increasingly popular and affordable, providing continuous measurements of heart rate, physical activity, and sleep. Alongside the growing digitization of health records, this creates new opportunities for large-scale, real-world health studies. In this study, we analyzed wearable-derived physiological patterns across a range of chronic conditions spanning both physical and mental health to better understand how these signals relate to disease risk. We found that incorporating wearable-derived heart rate, activity and sleep features improved disease risk classification across several conditions, with particularly strong gains for major depressive disorder and anxiety. By examining how individual features contributed to model predictions, we also identified meaningful associations between physiological signals and disease risk. For example, both duration and day-to-day variation of deep and rapid eye movement (REM) sleep were associated with increased risk in certain conditions. Our study supports the development of real-time, automated tools to assess disease risk alongside clinical care.

18

Spine Reviews: Crowdsourcing Global Spine Expert Knowledge via Digital Ledger Technology

Challier, V.; Diebo, B.; Lafage, V.; Dehouche, N.; Lonjon, G.; Cristini, J.; SpineDAO,

2026-04-13 health informatics 10.64898/2026.04.11.26350678 medRxiv

Top 0.1%

26.1%

Show abstract

Study Design: Prospective observational study using a novel digital ledger technology (DLT)-based crowdsourcing platform. Objective: To develop and evaluate Spine Reviews, a blockchain-based platform for aggregating spine treatment recommendations from an international specialist panel, and to validate the clinical coherence of the resulting dataset. Summary of Background Data: Predictive models for low back pain treatment are limited by small, homogeneous datasets that fail to capture inter-clinician variability. Traditional multi-center data collection is expensive, slow, and geographically constrained. DLT-based crowdsourcing with cryptographic credentialing may overcome these barriers. Methods: Five hundred synthetic patient vignettes (digital twins) were generated; 463 retained after quality control. A review platform was built on the Solana blockchain using non-transferable Soulbound Tokens (SBTs) for credentialing and smart-contract compensation. Fifty-two specialists from 7 countries provided 4+ reviews per vignette across four treatment tiers, without access to imaging or physical examination. Mixed-effects regression with reviewer random intercepts partitioned decision variability. Results: The platform collected 2,066 completed reviews (97.7%) over 37 days at USD 0.97/review. Variance decomposition revealed that 36.7% of treatment tier variability was attributable to patient presentation, 19.2% to reviewer practice style, and 44.1% to their interaction. Neurological deficits (beta=0.39), symptom duration (beta=0.12), and pain (beta=0.09) independently predicted treatment escalation (all p<0.001). Gwet's AC1 was almost perfect for emergency (0.92) and substantial for conservative decisions (0.67). Reviewer confidence in treatment recommendations decreased with escalating tier severity (conservative 4.59/5 vs surgical 4.05/5), suggesting appropriate uncertainty calibration. Conclusions: DLT with SBT credentialing enables rapid, global, cost-effective aggregation of clinically coherent expert judgment. The three-component variance structure quantifies clinical equipoise in spine care and establishes that predictive models require diverse, multi-reviewer training data. Keywords: digital ledger technology; blockchain; crowdsourcing; clinical decision-making; low back pain; Soulbound Tokens

19

Human vs AI Clinical Assessment: Benchmarking a Multimodal Foundation Model Against Multi-Center Expert Judgment on the Mental Status Examination.

Mwangi, B.; Jabbar Abdl Sattar Hamoudi, H.; Sanches, M.; Dogan, N.; Chaudhary, P.; Wu, M.-J.; Zunta-Soares, G. B.; Soares, J. C.; Martin, A.; Soutullo, C. A.

2026-04-20 psychiatry and clinical psychology 10.64898/2026.04.17.26351105 medRxiv

Top 0.1%

26.0%

Show abstract

The Mental Status Examination (MSE) is the cornerstone of the psychiatric evaluation, yet validating artificial intelligence (AI) against the inherent variance of clinical judgment remains a critical bottleneck. Here we introduce a multi-center framework to benchmark the open-weight multimodal foundation model Qwen3-Omni against independent expert panels at two sites, UTHealth and Yale. Evaluating 396 classifications across 10 MSE domains and three longitudinal timepoints of increasing symptom severity, we found that experts achieved substantial agreement (Gwets AC1 = 0.87), whereas the model achieved only moderate alignment (AC1 = 0.70-0.72). Even as the models overall pathology prediction rate approximated the experts, the aggregate equilibrium masked a profound "clinical reasoning gap". Specifically, the model systematically over-predicted observable signs (e.g., speech, affect) while notably failing in inferential domains requiring the interpretation of latent mental content (e.g., delusions, perceptions). A 4-bit quantization analysis of the model confirmed this mechanistically: reducing model capacity disproportionately degraded inferential reasoning while preserving perceptual feature extraction. Furthermore, model-to-expert agreement degraded linearly as clinical complexity intensified across longitudinal visits (Accuracy: T0 = 84.8-87%; T1 = 80-82%; T2 = 71-73%), whereas expert consensus remained robust. Notably, model errors increased 2.3-to-3.4 fold where human experts disagreed. These findings establish inter-expert variance as an essential measurable baseline for psychiatric AI, demonstrating that true clinical translation requires models to move beyond multimodal perceptual extraction to achieve higher-order diagnostic reasoning.

20

A case report on gendered biases in a Finnish healthcare AI assistant

Luisto, R.; Snell, K.; Vartiainen, V.; Sanmark, E.; Äyrämö, S.

2026-04-14 health informatics 10.64898/2026.04.09.26350383 medRxiv

Top 0.1%

25.8%

Show abstract

In this study, we investigate gender bias in a Retrieval-Augmented Generation (RAG) based AI assistant developed for Finnish wellbeing services counties. We tested the system using 36 clinically relevant queries, each rendered in three gendered variants (male, female, gender-neutral), and evaluated responses using both an LLM-as-a-judge approach and a human expert panel consisting of a physician and a sociologist specializing in ethics. We observed substantial and clinically significant differences across gendered variants, including differential treatment urgency, inappropriate symptom associations, and misidentification of clinical context. Female variants disproportionately framed responses around childcare and reproductive health regardless of clinical relevance, reflecting societal stereotypes rather than medical reasoning. Bias manifested both at the LLM generation stage and the RAG retrieval stage, in several cases causing the model to hallucinate responses entirely. Some bias patterns were persistent across repeated runs, while others appeared inconsistently, highlighting the challenge of distinguishing systematic bias from stochastic variation.